test item
Ontologies in Motion: A BFO-Based Approach to Knowledge Graph Construction for Motor Performance Research Data in Sports Science
Ondraszek, Sarah Rebecca, Waitelonis, Jörg, Keller, Katja, Niessner, Claudia, Jacyszyn, Anna M., Sack, Harald
An essential component for evaluating and comparing physical and cognitive capabilities between populations is the testing of various factors related to human performance. As a core part of sports science research, testing motor performance enables the analysis of the physical health of different demographic groups and makes them comparable. The Motor Research (MO|RE) data repository, developed at the Karlsruhe Institute of Technology, is an infrastructure for publishing and archiving research data in sports science, particularly in the field of motor performance research. In this paper, we present our vision for creating a knowledge graph from MO|RE data. With an ontology rooted in the Basic Formal Ontology, our approach centers on formally representing the interrelation of plan specifications, specific processes, and related measurements. Our goal is to transform how motor performance data are modeled and shared across studies, making it standardized and machine-understandable. The idea presented here is developed within the Leibniz Science Campus ``Digital Transformation of Research'' (DiTraRe).
- Europe > Germany > Baden-Württemberg > Karlsruhe Region > Karlsruhe (0.26)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- Europe > Portugal > Lisbon > Lisbon (0.04)
- Asia > Japan (0.04)
- Information Technology > Security & Privacy (1.00)
- Health & Medicine > Consumer Health (1.00)
The fragility of "cultural tendencies" in LLMs
In a recent study, Lu, Song, and Zhang (2025) (LSZ) propose that large language models (LLMs), when prompted in different languages, display culturally specific tendencies. They report that the two models (i.e., GPT and ERNIE) respond in more interdependent and holistic ways when prompted in Chinese, and more independent and analytic ways when prompted in English. LSZ attribute these differences to deep-seated cultural patterns in the models, claiming that prompt language alone can induce substantial cultural shifts. While we acknowledge the empirical patterns they observed, we find their experiments, methods, and interpretations problematic. In this paper, we critically re-evaluate the methodology, theoretical framing, and conclusions of LSZ. We argue that the reported "cultural tendencies" are not stable traits but fragile artifacts of specific models and task design. To test this, we conducted targeted replications using a broader set of LLMs and a larger number of test items. Our results show that prompt language has minimal effect on outputs, challenging LSZ's claim that these models encode grounded cultural beliefs.
- Europe > Germany > Baden-Württemberg > Stuttgart Region > Stuttgart (0.04)
- Europe > Germany > Baden-Württemberg > Tübingen Region > Tübingen (0.04)
- Asia > Japan (0.04)
- Asia > China > Shanghai > Shanghai (0.04)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study > Negative Result (0.68)
Leveraging AI Graders for Missing Score Imputation to Achieve Accurate Ability Estimation in Constructed-Response Tests
Evaluating the abilities of learners is a fundamental objective in the field of education. In particular, there is an increasing need to assess higher-order abilities such as expressive skills and logical thinking. Constructed-response tests such as short-answer and essay-based questions have become widely used as a method to meet this demand. Although these tests are effective, they require substantial manual grading, making them both labor-intensive and costly. Item response theory (IRT) provides a promising solution by enabling the estimation of ability from incomplete score data, where human raters grade only a subset of answers provided by learners across multiple test items. However, the accuracy of ability estimation declines as the proportion of missing scores increases. Although data augmentation techniques for imputing missing scores have been explored in order to address this limitation, they often struggle with inaccuracy for sparse or heterogeneous data. To overcome these challenges, this study proposes a novel method for imputing missing scores by leveraging automated scoring technologies for accurate IRT-based ability estimation. The proposed method achieves high accuracy in ability estimation while markedly reducing manual grading workload.
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
- North America > United States > New York > New York County > New York City (0.04)
- North America > Montserrat (0.04)
- (2 more...)
- Research Report > New Finding (0.67)
- Research Report > Promising Solution (0.54)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- (2 more...)
Do LLMs Give Psychometrically Plausible Responses in Educational Assessments?
Säuberli, Andreas, Frassinelli, Diego, Plank, Barbara
Knowing how test takers answer items in educational assessments is essential for test development, to evaluate item quality, and to improve test validity. However, this process usually requires extensive pilot studies with human participants. If large language models (LLMs) exhibit human-like response behavior to test items, this could open up the possibility of using them as pilot participants to accelerate test development. In this paper, we evaluate the human-likeness or psychometric plausibility of responses from 18 instruction-tuned LLMs with two publicly available datasets of multiple-choice test items across three subjects: reading, U.S. history, and economics. Our methodology builds on two theoretical frameworks from psychometrics which are commonly used in educational assessment, classical test theory and item response theory. The results show that while larger models are excessively confident, their response distributions can be more human-like when calibrated with temperature scaling. In addition, we find that LLMs tend to correlate better with humans in reading comprehension items compared to other subjects. However, the correlations are not very strong overall, indicating that LLMs should not be used for piloting educational assessments in a zero-shot setting.
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
- Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
- North America > United States > Florida > Miami-Dade County > Miami (0.04)
- (4 more...)
DeCo: Defect-Aware Modeling with Contrasting Matching for Optimizing Task Assignment in Online IC Testing
Ting, Lo Pang-Yun, Chiang, Yu-Hao, Tsai, Yi-Tung, Lai, Hsu-Chao, Chuang, Kun-Ta
In the semiconductor industry, integrated circuit (IC) processes play a vital role, as the rising complexity and market expectations necessitate improvements in yield. Identifying IC defects and assigning IC testing tasks to the right engineers improves efficiency and reduces losses. While current studies emphasize fault localization or defect classification, they overlook the integration of defect characteristics, historical failures, and the insights from engineer expertise, which restrains their effectiveness in improving IC handling. To leverage AI for these challenges, we propose DeCo, an innovative approach for optimizing task assignment in IC testing. DeCo constructs a novel defect-aware graph from IC testing reports, capturing co-failure relationships to enhance defect differentiation, even with scarce defect data. Additionally, it formulates defect-aware representations for engineers and tasks, reinforced by local and global structure modeling on the defect-aware graph. Finally, a contrasting-based assignment mechanism pairs testing tasks with QA engineers by considering their skill level and current workload, thus promoting an equitable and efficient job dispatch. Experiments on a real-world dataset demonstrate that DeCo achieves the highest task-handling success rates in different scenarios, exceeding 80\%, while also maintaining balanced workloads on both scarce or expanded defect data. Moreover, case studies reveal that DeCo can assign tasks to potentially capable engineers, even for their unfamiliar defects, highlighting its potential as an AI-driven solution for the real-world IC failure analysis and task handling.
- North America > United States (0.04)
- Asia > Taiwan (0.04)
- Research Report > Promising Solution (0.34)
- Overview > Innovation (0.34)
- Semiconductors & Electronics (1.00)
- Information Technology > Hardware (0.49)
- Information Technology > Data Science > Data Mining (0.93)
- Information Technology > Artificial Intelligence > Natural Language (0.68)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.68)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.46)
CHART-6: Human-Centered Evaluation of Data Visualization Understanding in Vision-Language Models
Verma, Arnav, Mukherjee, Kushin, Potts, Christopher, Kreiss, Elisa, Fan, Judith E.
Data visualizations are powerful tools for communicating patterns in quantitative data. Yet understanding any data visualization is no small feat -- succeeding requires jointly making sense of visual, numerical, and linguistic inputs arranged in a conventionalized format one has previously learned to parse. Recently developed vision-language models are, in principle, promising candidates for developing computational models of these cognitive operations. However, it is currently unclear to what degree these models emulate human behavior on tasks that involve reasoning about data visualizations. This gap reflects limitations in prior work that has evaluated data visualization understanding in artificial systems using measures that differ from those typically used to assess these abilities in humans. Here we evaluated eight vision-language models on six data visualization literacy assessments designed for humans and compared model responses to those of human participants. We found that these models performed worse than human participants on average, and this performance gap persisted even when using relatively lenient criteria to assess model performance. Moreover, while relative performance across items was somewhat correlated between models and humans, all models produced patterns of errors that were reliably distinct from those produced by human participants. Taken together, these findings suggest significant opportunities for further development of artificial systems that might serve as useful models of how humans reason about data visualizations. All code and data needed to reproduce these results are available at: https://osf.io/e25mu/?view_only=399daff5a14d4b16b09473cf19043f18.
- North America > United States > California > Los Angeles County > Los Angeles (0.14)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- (2 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
Efficacy of a Computer Tutor that Models Expert Human Tutors
Olney, Andrew M., D'Mello, Sidney K., Person, Natalie, Cade, Whitney, Hays, Patrick, Dempsey, Claire W., Lehman, Blair, Williams, Betsy, Graesser, Art
Tutoring is highly effective for promoting learning. However, the contribution of expertise to tutoring effectiveness is unclear and continues to be debated. We conducted a 9-week learning efficacy study of an intelligent tutoring system (ITS) for biology modeled on expert human tutors with two control conditions: human tutors who were experts in the domain but not in tutoring and a no-tutoring condition. All conditions were supplemental to classroom instruction, and students took learning tests immediately before and after tutoring sessions as well as delayed tests 1-2 weeks later. Analysis using logistic mixed-effects modeling indicates significant positive effects on the immediate post-test for the ITS (d =.71) and human tutors (d =.66) which are in the 99th percentile of meta-analytic effects, as well as significant positive effects on the delayed post-test for the ITS (d =.36) and human tutors (d =.39). We discuss implications for the role of expertise in tutoring and the design of future studies.
- North America > United States > Colorado > Boulder County > Boulder (0.14)
- North America > United States > Tennessee > Shelby County > Memphis (0.04)
- North America > United States > Virginia > Arlington County > Arlington (0.04)
- (6 more...)
- Research Report > New Finding (1.00)
- Instructional Material > Course Syllabus & Notes (1.00)
- Research Report > Experimental Study > Negative Result (0.47)
Comparing Human Expertise and Large Language Models Embeddings in Content Validity Assessment of Personality Tests
Milano, Nicola, Ponticorvo, Michela, Marocco, Davide
In this article we explore the application of Large Language Models (LLMs) in assessing the content validity of psychometric instruments, focusing on the Big Five Questionnaire (BFQ) and Big Five Inventory (BFI). Content validity, a cornerstone of test construction, ensures that psychological measures adequately cover their intended constructs. Using both human expert evaluations and advanced LLMs, we compared the accuracy of semantic item-construct alignment. Graduate psychology students employed the Content Validity Ratio (CVR) to rate test items, forming the human baseline. In parallel, state-of-the-art LLMs, including multilingual and fine-tuned models, analyzed item embeddings to predict construct mappings. The results reveal distinct strengths and limitations of human and AI approaches. Human validators excelled in aligning the behaviorally rich BFQ items, while LLMs performed better with the linguistically concise BFI items. Training strategies significantly influenced LLM performance, with models tailored for lexical relationships outperforming general-purpose LLMs. Here we highlights the complementary potential of hybrid validation systems that integrate human expertise and AI precision. The findings underscore the transformative role of LLMs in psychological assessment, paving the way for scalable, objective, and robust test development methodologies.
Large Language Models Often Say One Thing and Do Another
Xu, Ruoxi, Lin, Hongyu, Han, Xianpei, Zheng, Jia, Zhou, Weixiang, Sun, Le, Sun, Yingfei
As large language models (LLMs) increasingly become central to various applications and interact with diverse user populations, ensuring their reliable and consistent performance is becoming more important. This paper explores a critical issue in assessing the reliability of LLMs: the consistency between their words and deeds. To quantitatively explore this consistency, we developed a novel evaluation benchmark called the Words and Deeds Consistency Test (WDCT). The benchmark establishes a strict correspondence between word-based and deed-based questions across different domains, including opinion vs. action, non-ethical value vs. action, ethical value vs. action, and theory vs. application. The evaluation results reveal a widespread inconsistency between words and deeds across different LLMs and domains. Subsequently, we conducted experiments with either word alignment or deed alignment to observe their impact on the other aspect. The experimental results indicate that alignment only on words or deeds poorly and unpredictably influences the other aspect. This supports our hypothesis that the underlying knowledge guiding LLMs' word or deed choices is not contained within a unified space. In recent years, large language models (LLMs) have become more prevalent in various practical applications, such as grounded planning (Dagan et al., 2023; Song et al., 2023).
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (0.68)
QG-SMS: Enhancing Test Item Analysis via Student Modeling and Simulation
Nguyen, Bang, Du, Tingting, Yu, Mengxia, Angrave, Lawrence, Jiang, Meng
While the Question Generation (QG) task has been increasingly adopted in educational assessments, its evaluation remains limited by approaches that lack a clear connection to the educational values of test items. In this work, we introduce test item analysis, a method frequently used by educators to assess test question quality, into QG evaluation. Specifically, we construct pairs of candidate questions that differ in quality across dimensions such as topic coverage, item difficulty, item discrimination, and distractor efficiency. We then examine whether existing QG evaluation approaches can effectively distinguish these differences. Our findings reveal significant shortcomings in these approaches with respect to accurately assessing test item quality in relation to student performance. To address this gap, we propose a novel QG evaluation framework, QG-SMS, which leverages Large Language Model for Student Modeling and Simulation to perform test item analysis. As demonstrated in our extensive experiments and human evaluation study, the additional perspectives introduced by the simulated student profiles lead to a more effective and robust assessment of test items.
- North America > Canada > Ontario > Toronto (0.14)
- Asia > Middle East > UAE (0.14)
- North America > United States > Wisconsin (0.14)
- (5 more...)
- Research Report > New Finding (0.34)
- Instructional Material > Course Syllabus & Notes (0.31)
- Education > Educational Technology > Educational Software (0.61)
- Education > Assessment & Standards > Student Performance (0.51)